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FOREWORD 



This manual has been written with the hope and 
expectation that people who are not familiar with the 
complex statistics of educational experimentation will 
be able to use it in analyzing experimental results in 
a sound fashion. 

The preparation of this manual was originally- 
supported by the Fund for the Advancement of Education. 
The following persons helped the authors in one way or 
another in preparing this manual: Mrs. Anne H. Ferris, 
Miss Henrietta Gallagher } Dr. Martin Katz, and Dr. 
Marjorie Olsen. Acknowledgment is also due to Dr. 
Warren G. Findley, who made a number of useful sug- 
gestions on the basis of his use of the erperimental 
edition of this manual. 

Henry S. Dyer 
William B. Schrader 



Educational Testing Service 
Princeton, New Jersey, i960 
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ANALYZING THE RESULTS OP AN EDUCATIONAL EXPERIMENT 
(Analysis of Covariance) 

Introduction 

The present ferment in American education is producing many new 
approaches to instruction — new methods, new curricula, new devices, 
new patterns of classroom organization. As a result there is an 
Increasing urge to find valid means for evaluating these new 
approaches "by comparing them with one another and with the conven- 
tional ways of doing things. People wish to know whether there is 
any measurable difference between the old and the new in the amount 
of learning produced in pupils. They wish to know vhether the gain 
in performance of a group of pupils treated one way differs signifi- 
cantly from the gain that would have occurred if the same group had 
"been treated in another way. 

Since it is impossible to treat one group in two ways simul- 
-teneously, it is necessary to deal with two or more groups each of 
which is treated differently from the others. A valid comparison 
of the gains made by the several groups requires that allowance 
must be made for initial differences between the groups. The 
statistical technique called analysis of covariance is generally 
regarded as the most rigorous means for making such adjustments 
and furnishing soundly interpretable results. 

The method of analysis of covariance is one which make*, the 
most of the available data and provides a valid interpretation of 
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what the outcome of the experiment means. Hitherto this method has 
been accessible only to those with a sophisticated understanding of 
statistical formulas and procedures . The present explanation attempts 
to reduce all such complicated procedures to a step-by-step process 
that can be handled by anyone with good command of ordinary arithmetic 
°nd some understanding of algebra, The method to be described is 
based directly upon an original paper by Gulliksen and Wilks in the 
June 1950 issue of Psychometrika .^ 

The explanation that follows is built around a typical experiment. 
The nature of the experiment is described; the data obtained from it 
are given in full, and the analysis of "the data is worked out in com- 
plete detail. The reader who wishes to use this approach on an experi- 
ment of his own is advised first to study the data carefully and then 
to work out each step of the analysis himself , checking his own results 
at every stage against those given. Once he is sure he has mastered 
a ,he procedure , he may simply substitute the data from his own experi- 
ment for those given here, and then work through the same steps in 
analysis . 

The Probl em 

Three classes of a course in chemistry were taught using special 
TV lectures and kinescopes. Three similar classes were taught by the 
conventional methods. The group taught by television made an average 
gain of 11 points on the final test as compared to their scores on a 
pretest of chemical knowledge. The gi^oup taught by the conventional 
method made an average gain of k points on the same test. The experi- 
menter needs to know the answers to three questions before he can con- 
fidently evaluate the results of his experiment. First, he wants to 



Gulliksen, H. & Wilks, S. S. Regression Tests for Several Examples, 
Psychometrika, 1950, 15, 91-11 1 *-. 
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knov whether differences in initial ability between the television and 
the conventional group may have accounted for all or part of the dif- 
ference in results. Second, he wants to know whether the average 
difference is large enough to rule out the possibility that it arose 
merely by chance. Finally, and most Important, he wants to know how 
big the average difference is after allowing for possible differences 
in ability of the two groups being compared. This manual provides a 
standard plan for getting the answers to all three questions. 

Before going further, it may be well to explain what demands this 
plan places upon the person who uses it. First, all of the members in 
the two groups to be compared must have been given one or more tests 
prior to the training. Second, the experiment- must have included at 
least fifty people in the experimental group and fifty people in the 
control group. Third, the person who has already invested a great 
deal of time, effort, and money in conducting an experiment must be 
willing and able to carry out some fairly tedious though not difficult 
arithmetic operations in order to be able to evaluate his results 
statistically. Fourth, the person who actually does the statistical 
analysis needs a reasonably good knowledge of certain topics in high 
school algebra including the use of logarithms for calculation. Fifth, 
the person who takes the primary responsibility for the taGk of statis- 
tical analysis must be willing to devote several hours of thoughtful 
study to the concepts involved in the method, unless he has had recent 
study in statistical methods as applied to educational data. 

The General Nature of the Analysis 

What is the general nature of the analysis that follows? It works 
on the principle that there is a relationship between the score obtained 
by each student at the beginning of training and the score obtained at the 



* One final comment may be made regarding the computing which is 
required. The work will be very much facilitated if it is done 
on a conventional desk calculating machine. A power-driven 
machine is very advantageous for this purpose. Many schools 
have such machines for the purpose of calculating grade averages 
and other numerical reports. 
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erid of training. In general, students with the higher scores at the 
beginning will tend to have the higher scores at the end, regardless 
of whether such students are in the experimental or the control group. 
This relationship may probably best be visualized as a "line of relation" 
between initial and final scores. Figure 1 shows what such a line of 
relation would look like for the group that studied chemistry by TV, 
that is, the experimental group. Notice that the relationship is far 
from perfect. Some students with low scores on the initial test got 
relatively high scores (above the line of relation) on the final test, 
while others with high scores on the initial test got relatively low 
scores (below the line of relation) on the final test* The vertical 
distance between any dot and the line of relation is an "error of 
prediction. 11 That is, it shows how much the initial score is in 
error in predicting the final score. 

A similar line of relation might be drawn between the initial and 
final scores of the control group. Figure 2 shows how the two lines of 
relations might look whftn drawn on the same chart. 

s 

The first question the experimenter asks is this: Are the errors 
of prediction greater as a whole in one group than in the other to an 
extent which cannot be attributable to chance? It is possible in some 
experiments that the results of " the training would greatly lower or 
greatly increase the errors of -nrediction from the pretest. Usually, 
this is not the case* so we shall suppose that the experimenter finds 
that the errors of estimate in predicting final scores from pretest 
scores are no greater for one group than for the other. 

He then proceeds to the second question. Is the line of relation 
s teeper for one group than for the other? For example, it might happen 
that students of high initial ability in the experimental group would 
gain more than comparable students in the control group as the result 
of the kind o.f training given to the experimental group while students 



of low ability would do wors-e than coii^.xrable students in the control 
group. If this happened, one of the lines of relation would be notice- ■ 
ably steeper than the other, (See Figure 3) If the difference in 
slope proved to be statistically significant; the experimenter would 
have to conclude that the relative effects of the training were dif- 
ferent for difforent levels of ability as measured by the initial test. 
Again, however, it is more likely that it would turn out that the 
steepness of the two lines would not be significantly different* Let 
us assume that it is not different. Now, it is clear that the two 
lines for the groups being compared may be regarded as parallel > as 
in Figure 2. At this point, the experimenter can ask the third question 
Typically, this is the heart of the results. On the average, has the 
television group done better relative to its ability as measured by 
the initial test than has the conventionally-trained group? By applying 
a third test, let us suppose t.he experimenter finds that the difference 
is indeed statistically significant. He can then determine the typical 
amount of difference between the tvro groups. This difference is the 
vertical distance between the two lines of relation and will be uniform 
for all levels of initial test performance. These three questions are 
essentially hypotheses that will be tested in the course of the analysis 

If the procedure outlined below seems rather elaborate, it is well 
to keep in mind the particular pitfalls which this method is designed 
to avoid. If the scores used in adjusting for differences between the 
two groups do not show the same errors of estimate or the same slopes 
in relation to the measure of success in the experiment, the plan out- 
lined here is especially valuable. Thus, if it should turn out that 
a particular method of training was superior for high scoring students 
but inferior for low scoring p+.udents, the results obtained by simpler 
methods would obscure these two results by merging them into a net 
result which would depend only on the proportion of able and inferior 
students in the particular groups studied- Further advantages of this 
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plan are (l) that the statistical significance tests are nu?de step by 
step as a direct part of the procedure and (2) that allowance is made 
for differences in ability at each step of the way. 

MANAGING THE STATISTICAL ANALYSIS 

Selecting the Best Predictor * 

If several initial test scores are available as predictors of 
final scores (e.g., pretest, intelligence test, reading comprehension 
test), it is necessary either to decide upon which one to use or to 
find a suitable way to combine information from each. The following 
procedure offers a relatively quick and easy way to develop data to 
aid in making a choice, "before beginning the analysis proper. 

The first step in this work is to copy data for all the available 
initial tests on the answer sheets of the final test being careful to 
label each score with the designation of the test. Make up a combined 
set of final test answer sheets including t-'ie same number of papers 
from the experimental as from the control group, (if the number of 
papers differs for the two groups, eliminate papers at random from 
the larger group.) From this set of papers, choose the half which 
shows the highest score on the final test. (it may happen that the 
lowest score in the top half and the highest score in the bottom half 
are the same. Simply assign papers showing this score to the two 
groups at random.) Place a distinctive mark (a red check mark) on each 
paper in the top half. Now, using all the papers, sort them into an 
upper and a lower half on Initial Test A. Count the number of students 
in the top half on the final test . A similar routine should bQ followed 
for each of the other tests. The initial test which laaB the most candi- 
dates t/ho score in the top half in it who are also in the top half in 
final score is the best predictor of the final score. 




* This section may be sitipped if only one predictor is to be used in 
the experiment. 



The extension of the foregoing idea to the combination of pre- 
dictors is direct. For example, the students maybe divided into 
an upper and a lower half on the "basis of the sum of scores on tvo 
predictors. If the number of students i.i both the top half on the 
Bum of the tvo predictor scores and the top half on the final test 
Bcores is larger than when the predictors are used separately, then 
the combination scores should be used as the predictor in the main 
study. A more elaborate approach would give more weight to one pre- 
dictor than the other in arriving at a combined total. Here again, 
however, the evaluation of the effectiveness of the combination 
would follow the procedure described above. 

The Data from \">he TV Experiment in Chemistry Instruction 

On the next two pages (10 and ll) are the data from the TV ex- 
periment that we are using as a basis for demonstrating the method 
of analysis. In this case the "initial score," which will be used 
for predicting the "final score," happens to be the score on Fcrr: A 
of an achievement test in chemistry. The "final score" is the score 
on Form B of the same achievement test. 
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DATA USED IN WORKED EXAMPLE 
Experimental Group 





Initial 


Final 




Initial 


Final 


Student 


Score 


Score 


Student 


Score 


Score 


1 


13 


27 


41 


18 


30 


2 


33 


38 


42 


30 


51 




51 


50 


43 


21 


30 


k 


40 


39 


44 


19 


43 


5 


20 


46 


45 


34 


53 




16 


41 


1*6 


17 


32 


7 
i 




39 


47 


25 


4o 
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18 


35 


48 


28 


28 
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19 


17 


49 


36 


6i 


10 


36 


20 


50 


43 


31 


11 


35 


50 


51 


17 


42 


12 




38 


52 


4l 


57 


13 


30 


46 


53 


27 


37 


Ik 


42 


36 


54 


31 


42 


15 


21 


39 


55 


22 


35 


16 


31 


61 


56 


28 


28 


17 


38 


40 


57 


42 


48 


18 


18 


38 


58 


31 


37 


19 


26 


52 


59 


27 


35 


20 


25 


29 


60 


23 


26 


21 


25 


35 


6l 


17 


35 


22 




26 


62 


21 


35 


23 


33 


46 


63 


33 


4i 


2k 


32 . 


42 


64 


30 


44 


25 


27 


42 


65 


18 


36 


26 


35 


40 


66 


28 


45 


27 


rk 


28 


67 


18 


24 


28 




■^8 


68 










PQ 
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pU 


30 


13 


22 


70 


20 


23 


31 


27 


29 


71 


34 


42 


32 


37 


56 


72 


39 


58 


33 


40 


6l 


73 


25 


20 


34 


37 


37 


74 


39 


60 


35 


36 


55 


75 


37 


53 


36 


14 


22 








37 


23 


32 








38 


31 


42 








39 


26 


4i 








4o 


32 


65 
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DATA USED IN WORKED EXAMPLE 
Control Group 





Initial 


Final 




Initial 


Final 


Student 


Score 


Score 


Student 


Score 


Score 


1 


26 


->i. 
34 


41 


31 


42 


2 


12 


16 


42 


1,1, 
44 


40 


3 


27 


25 


43 


17 


27 


4 


28 


34 


44 


18 


23 


5 


30 


41 


45 


31 


33 


D 


14 


14 


ho 


2o 


30 


(-7 

7 


17 


27 


47 


40 


43 


8 


3^ 




48 


20 


35 


9 


27 


25 


49 


30 


16 


10 


c c 

55 


o2 


50 


32 


35 


11 


23 


35 


51 


42 


40 


12 


25 


29 


52 


47 


40 


13 


26 


4l 


53 


47 


31 


ih 


18 


38 


54 


42 


31 


15 


43 


47 


55 


4l 


47 


lb 


31 


30 


56 


21 


24 


17 


33 


35 


57 


23 


33 


18 


18 


2lf 


58 


38 


48 


19 


37 


k& 


59 


50 


55 


20 


16 


20 


60 


32 


4o 


21 


40 


i.i. 
44 


61 


26 


38 


22 


21 


20 


62 


21 


29 


23 


39 


52 


63 


25 


19 


24 


29 


ko 


64 


32 


34 


25 


30 


35 


65 


40 


51 


20 


13 


11 


66 


k2 


56 


2 f 


23 


33 


67 


40 


33 


28 


28 


28 


68 






29 


43 


35 


69 


28 


38 


30 


25 


37 


70 


32 


53 


31 


37 


42 


71 


32 


29 


32 


26 


37 


72 


18 


30 


33 


28 


30 


73 


50 


47 


3^ 


47 


4o 


74 


51 


57 


35 


52 


57 




36 


26 


15 








37 


32 


37 








38 


16 


09 








39 


31 


37 








40 


36 


46 
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The Analysis 

The worksheets provided with this manual specify, step by step, 
all the calculations needed to evaluate the three hypotheses mentioned 
on pages 2 and 3 and to make a simple graphic presentation of the re- 
sults. Each worksheet is organized to accomplish the computation of 
a particular set of basic figures. Thus, Worksheet A is for computing 
the necessary variances. In these worksheets, figures which are to 
be copied later are marked by an asterisk. The space into which they 
are to be copied gives the source from which the number is to be 
copied {worksheet (by letter) and the line (by number), e.g. >"A-I3" 



means line 3 on worksheet Aj. It is a matter of the greatest impor- 
tance that the work be carefully checked. The best method would 
require re-doing the entire analysis and comparing the results. In 
any case, however, all copying should be carefully checked because it 
happens all too often, even with skilled computers, that errors in 
copying figures occur. 

One question in computing arises in deciding how many decimal 
places to carry. Pour decimal places should insure adequate precision 
in the final results and it is recommended that the work be carried 
to that number of places. The slightly added work will be more than 
repaid in the added confidence you will have in the final results. 
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Now we are ready to see what the outcome of the experiment is > 
Here is the way we go about it. 

Hypothesis A : Do the errors of prediction in the experimental group 
differ significantly from those in the control group? 

(1) If the value on Worksheet F, line 69, equals or exceeds 2.882, 
then the errors of prediction for the experimental group differ from the 
errors of prediction for the control group at the 1 per cent level of 
confidence . This means that there is less than one chance in 100 that 
such a difference would arise by chance. Such a difference is regarded 
as "very significant . " 

(2) If the value on Worksheet F, line 69, equals or exceeds 1.668, 
but is less than 2.882, then the errors of prediction for the experimental 
group differ from the errors of prediction for the control group at the 

5 per cent level of confidence. This means that there is less than five 
chances in 100, but more than one chance in 100, that such a difference 
would arise by chance. Such a difference is regarded as "significant." 

If the difference found in testing Hypothesis A is either significant 
or very significant, one. should conclude that the results of the experi- 
ment are indeterminate . Nothing more can be said. 

If, however, the difference found in testing Hypothesis A is not 
significant, one should then look at the outcome for Hypothesis B. 

In the case of the present experiment, the difference found in 
testing Hypothesis A is 1A383 (Worksheet F, line 69). (The minus sign 
should be disregarded.) This means that the difference found in testing 
Hypothesis A is not significant. Therefore we move on to test Hypothesis 
B. 
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Hypothesis B ; Do the slopes of the two lines of relation differ 
significantly? 

(1) If the value on Worksheet G, line 75* equals or exceeds 2,882, 
then the difference between the slopes of the lines of relation is 
very significant as defined above, 

(2) If the value on Worksheet G, line 75; equals or exceeds 1,668, 
but is less than 2,882, then the difference between the slopes in the 
lines of relation is significant. 

If the difference found in testing Hypothesis g turns out to be 
significant or very significant, one concludes that the effects of 
instruction differ from students of different ability. One cannot 
make any general statement about any general difference between the 
experimental and control groups. 

If, however, the difference found in testing Hypothesis B is not 
significant, then one should look at the outcome for Hypothesis C, 

In the case of the present experiment the difference found in test- 
ing Hypothesis B is .1237 (Worksheet G, line 75; disregarding the 
negative sign). This means that the difference between the slopes of 
the lines of relation are not significant . Therefore, we move on to 
test Hypothesis C — the pay-off hypothesis. 

Hypothesis C : Is the distance between the two lines of relation 
significantly greater than zero? 

(1) If the value on Worksheet G, line 80, equals or exceeds 2,882, 
then the overall difference between the experimental and control groups 
is very significant. 

(2) If the value on Worksheet G, line 80 equals or exceeds 1,668, 
but is less than 2,882, then the overall difference between the experi- 

q mental and control group is significant. 

ERIC 
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If the difference is either significant or very significant 

one can conclude that the effect of instruction on one group is in 

all probability really different from the effect of instruction on 
the other group. 

In the present experiment the difference found in testing 
Hypothesis C is 9.2H99 (Worksheet G ; line 80, disregarding the 
minus sign). This is a very significant difference. This means 
that In all probability the different kinds of instruction have 
had genuinely different effects. 



SUMMARY STATEMENT 



For each of the three hypotheses, the following applies: 



LEVEL OF SIGNIFICANCE 



MINIMUM CALCULATED VALUE 



1 per cent 
5 per cent 



2.882 
1.668 



23 - 



GO 



-p 

<D 
CO 

a) 

CU 

a 

o 

co 

B- 

o 

i 
i 

QJ 
0) 

cu 

CU 
CJ 

ri 
cu 

CH 
<H 
• H 
nd 

<U 

-P 

Ch 
o 

cu 

& 

-p 

CJ 
•H 



o 
-p 

cu 
id 

o 



CO 

o 



o 

CO 

ad 
cu 

a 
o 

CO 



CJ 

P< cd 
cu 

u 

-p o 



ERLC 



o 

•H 
-P 

cd 
r— l 
cu 

O 

CO 

cu 

5 



cu 

-a 
-p 





- 21* - 



Graphing the Final Results 

If the first two hypotheses do not yield "signif leant" differences, 
but the third hypothesis does, a highly effective graphical presentation 
of the results is possible. In this case, lines of relation between 
the measure of achievement and the predictive measure can be regarded 
as parallel. The vertical distance between the two lines can be taken 
as an indication of the extent to which one group excels the other. If 
either Hypothesis A or Hypothesis B is rejected, the two lines of re- 
lation will not be parallel. It may be useful, nevertheless , to draw 
the lines for the two groups , 

All the basic calculations needed for drawing the two linos are 
included on Worksheet H. In making the graph, it is necessary to lay 
out a vertical and a horizontal scale. The vertical scale should 
begin at a value somewhat smaller than the lowest of the four final 
values shown in the summary table at the foot of Worksheet H and extend 
a bit higher than the highest value. The vertical scale should have 
low scores at the bottom end and high scores at the top end. The 
horizontal scale should begin with a score somewhat lower than the 
lower of the two selected values of the initial measure and extend 
to a score somewhat higher than the higher of the two. On the hori- 
zontal scale, low scores should be placed at the left and high scores 
at the right end of the scale. 

Once the scales have been laid out,- it is necessary to plot the 
four points given in the summary table on Worksheet H. To locate the 
first point, proceed along the horizontal scale until you reach the 
value of the initial measure, and then proceed upward until you reach 
a point at the same vertical level as the computed value of the final 
measure for that point . For example, the first point in Worksheet H 
would be plotted by going to the right until 60 is reached, and then 
goiiig upward until 6^.6 is reached. When the point is located, it 
may be marked by a heavy dot . This point will be near the upper right- 
hand corner of the graph* Then, the other point for the experimental 
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group should be located, and marked by a heavy dot . A straight line 
drown through these points is the line of relation for the experimental 
group. Exactly the same procedure may be followed for the other line. 
Figure h shows the results for the worked example. 

Note that if neither Hypothesis A (Question l) nor Hypothesis B 
(Question 2) shows a significant difference, the two lines of relation 
will be parallel. If, however, there is a significant difference for 
either hypothesis, the two lines of relation will not be parallel. 

Presenting Final Results in a Table 

At times, it may be convenient to summarize the main statistical 
results of an experiment in a table. Table 1 has been prepared to 
serve as an illustration of a table giving a fairly full account of 
the design and results. The mean scores reported in Table 1 can be 
obtained directly from Worksheet H, lines 81 and 82. 

It must be noted that the summary statement shown in Table 1 
should be given only if the tests for both Hypothesis A and Hypothesis 
B are not significant and if the test for Hypothesis C ijs significant. 
In that case, the numerical value of the difference reported can 
readily be obtained by subtracting the control group value in line 88 
from the experimental group value in that line. As a check, the 
difference may also be determined by subtracting the control group 
value in line 91 from the experimental group value. 
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TABLE I 



RESULTS OF CHEMISTRY TRAINING EXPERIMENT 



Groups Studied; 



Experimental : 75 ^students who vere taught chemistry using 
special TV lectxires and kinescopes 



Control : 



7^ students who vere taught chemistry by 
conventional methods 



Measures Used: 



Initial : 



Form A of an achievement test in chemistry 



Final : 



Form B of an achievement test in chemistry 



Mean Scores: 



Initial Test 
Final Test 



Experimental 
Group 

27.71 
39.08 



Control 
Group 

31.30 
35-32 



Analysis of Covariance 



A. Equality of errors of estimate: Not significant 



B. Equality of slopes: Not significant 



C. Equality of intercepts: Significant at one per cent level 



Summary : The advantage of the experimental group on final scores, 
after allowing for differences between the groups on initial score, 
was 6.6 points. 
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